Spark: Spark Streaming

Spark Streaming is an extension of the core Spark API that enables high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ or TCP sockets and processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to file systems, databases, and live dashboards. Since Spark Streaming is built on top of Spark, users can apply Spark’s in-built machine learning algorithms (MLlib), and graph processing algorithms (GraphX) on data streams. Compared to other streaming projects, Spark Streaming has the following features and benefits:

Ease of Use: Spark Streaming brings Spark’s language-integrated API to stream processing, letting users write streaming applications the same way as batch jobs, in Java, Python and Scala.

Fault Tolerance: Spark Streaming is able to detect and recover from data loss mid-stream due to node or process failure.

Example real-time use cases are:

Website monitoring, network monitoring
Fraud detection
Web clicks
Advertising
Internet of Things sensors

Why Spark Streaming

Spark Streaming can be used to stream real-time data from different sources, such as Facebook, Stock Market, and Geographical Systems, and conduct powerful analytics to encourage businesses.There are five significant aspects of Spark Streaming which makes it so unique, and they are:

Integration:- Advanced Libraries like graph processing, machine learning, SQL can be easily integrated with it.
Combination:- The data which is getting streamed can be done in conjunction with interactive queries and also static datasets.
Load Balancing:- Spark Streaming has a perfect balancing of load, which makes it very special.
Resource usage:- Spark Streaming use the available resource in a very optimum way.
Recovery from stragglers and failures

How does Spark Streaming Woek?
Spark Streaming processes a continuous stream of data by dividing the stream into micro-batches called a Discretized Stream or DStream. DStream is an API provided by Spark Streaming that creates and processes micro-batches. DStream is nothing but a sequence of RDDs processed on Spark’s core execution engine like any other RDD. It can be created from any streaming source such as Flume or Kafka.

Data Streaming is a technique for transferring data so that it can be processed as a steady and continuous stream. Streaming technologies are becoming increasingly important with the growth of the Internet.

When data arriving continuously in sequence of unbounded, then it is called a data stream.
Streaming divides continuously flowing input data into discrete units for processing.
Stream processing is low latency processing and analyzing of streaming data. Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data. Spark Streaming is for use cases which require a significant amount of data to be quickly processed as soon as it arrives.

Spark Streaming can quickly recover from any kinds of failures or straggler.

Spark Streaming supports data sources such as HDFS directories, TCP sockets, Kafka, Flume, Twitter, etc. Data Streams can be processed with Spark’s core APIS, DataFrames SQL, or machine learning APIs, and can be persisted to a filesystem, HDFS, databases, or any data source offering a Hadoop OutputFormat.

How Spark Streaming Works

Spark Streaming divides a data stream into batches of X seconds called Dstreams, which internally is a sequence of RDDs.
Your Spark Application processes the RDDs using Spark APIs, and the processed results of the RDD operations are returned in batches.

Spark

Spark Streaming

No comments:

Post a Comment